DINOv3 Keypoint Segmentation (DinoKPSEG)
Annolid includes an experimental keypoint-centric segmentation model that:
Extracts frozen DINOv3 dense features (ViT patch grid).
Trains a configurable head (
conv,multitask, or unified attentionrelational).Uses Gaussian keypoint heatmaps (or circular masks) as supervision.
Runs inference via the same prediction pipeline used for YOLO, saving results as LabelMe JSON.
Training
This trainer consumes either:
A standard YOLO pose dataset (
data.yamlwithkpt_shapeand labels inlabels/*.txt), orA native LabelMe dataset (per-image
*.jsonnext to images, with point/polygon shapes), orA COCO keypoints dataset (
images+annotations/*.json).
python -m annolid.segmentation.dino_kpseg.train \
--data /path/to/YOLO_dataset/data.yaml \
--epochs 50
Native LabelMe training
Create a small spec YAML that points to a directory (or JSONL index) of LabelMe JSON files:
format: labelme
path: /path/to/dataset_root
train: annotations/train # dir of LabelMe JSONs (images resolved via imagePath/sidecars)
val: annotations/val
kpt_shape: [4, 3] # K keypoints, dims (2 or 3)
keypoint_names: [nose, leftear, rightear, tailbase]
Then train with:
python -m annolid.segmentation.dino_kpseg.train \
--data /path/to/labelme_spec.yaml \
--data-format labelme \
--epochs 50
Native COCO keypoints training
Create a COCO spec YAML:
format: coco
path: /path/to/dataset_root
image_root: . # optional; defaults to path
train: annotations/train.json
val: annotations/val.json
# Optional overrides:
# kpt_shape: [27, 3]
# keypoint_names: [nose, left_ear, ...]
Then train with:
python -m annolid.segmentation.dino_kpseg.train \
--data /path/to/coco_spec.yaml \
--data-format coco \
--epochs 50
Annolid will stage the COCO annotations into YOLO-pose labels automatically inside the run directory and train with the same DinoKPSEG pipeline.
One-command aggressive schedule launch config
Generate a ready-to-run training config YAML (useful after dataset split/prep):
python -m annolid.segmentation.dino_kpseg.dataset_tools train-config \
--data /path/to/data_split.yaml \
--output /path/to/configs \
--schedule-profile aggressive_s
You can also audit and precompute directly from COCO specs:
python -m annolid.segmentation.dino_kpseg.dataset_tools audit \
--data /path/to/coco_spec.yaml \
--data-format coco
python -m annolid.segmentation.dino_kpseg.dataset_tools precompute \
--data /path/to/coco_spec.yaml \
--data-format coco \
--model-name facebook/dinov3-vits16-pretrain-lvd1689m \
--layers -2,-1 \
--feature-merge concat
This writes a config like /path/to/configs/train_aggressive_s.yaml. Launch training with:
python -m annolid.segmentation.dino_kpseg.train \
--config /path/to/configs/train_aggressive_s.yaml
CLI flags always override config values, for example:
python -m annolid.segmentation.dino_kpseg.train \
--config /path/to/configs/train_aggressive_s.yaml \
--batch 16 \
--epochs 80
LabelMe conventions:
Keypoints are
shape_type: pointwithlabelmatching an entry inkeypoint_names.Instances are grouped via
group_id(all shapes with the samegroup_idbelong together).Polygons (
shape_type: polygon) are optional; when present they are used to compute per-instance crops in--instance-mode per_instance.
Defaults include Gaussian heatmaps, Dice loss, and a coordinate regression loss. Override as needed:
python -m annolid.segmentation.dino_kpseg.train \
--data /path/to/YOLO_dataset/data.yaml \
--mask-type gaussian \
--heatmap-sigma 3 \
--dice-loss-weight 0.5 \
--coord-loss-weight 0.25 \
--coord-loss-type smooth_l1
To train per instance (avoid multi-animal mask unions), use bounding-box crops:
python -m annolid.segmentation.dino_kpseg.train \
--data /path/to/YOLO_dataset/data.yaml \
--instance-mode per_instance \
--bbox-scale 1.25
To fuse multiple DINO layers, pass a comma-separated list (features are concatenated):
python -m annolid.segmentation.dino_kpseg.train \
--data /path/to/YOLO_dataset/data.yaml \
--layers -2,-1
You can also control how multi-layer features are merged:
python -m annolid.segmentation.dino_kpseg.train \
--data /path/to/YOLO_dataset/data.yaml \
--layers -3,-2,-1 \
--feature-merge mean
Optional trainable channel alignment before the keypoint head:
python -m annolid.segmentation.dino_kpseg.train \
--data /path/to/YOLO_dataset/data.yaml \
--layers -3,-2,-1 \
--feature-merge concat \
--feature-align-dim auto
Single-stage Multitask Head (Step 3)
Enable a shared single-stage multitask head (keypoints + objectness + box + instance-mask branches):
python -m annolid.segmentation.dino_kpseg.train \
--data /path/to/YOLO_dataset/data.yaml \
--head-type multitask \
--obj-loss-weight 0.25 \
--box-loss-weight 0.25 \
--inst-loss-weight 0.25
Notes:
Keypoint supervision and decoding stay unchanged.
Auxiliary branches are additive; set weights to
0to disable them.
Training Schedule Hardening (Step 4)
Recommended stability knobs:
python -m annolid.segmentation.dino_kpseg.train \
--data /path/to/YOLO_dataset/data.yaml \
--schedule-profile aggressive_s \
--ema \
--ema-decay 0.9995 \
--multitask-aux-warmup-epochs 10
What this adds:
EMA model used for validation/checkpoint export.
Auxiliary multitask losses warmed up over early epochs.
Non-finite loss batches are skipped defensively.
Runtime Safety Mode
For long runs on shared desktops/laptops, enable conservative runtime caps:
python -m annolid.segmentation.dino_kpseg.train \
--data /path/to/YOLO_dataset/data.yaml \
--auto-safe-mode
This mode applies safer limits to batch size, image short-side, workers, and torch CPU thread usage, and disables expensive TensorBoard graph/projector exports.
Manual runtime controls are also available:
--workers--max-cpu-threads--max-interop-threads
In the GUI (Train Models -> DINO KPSEG -> Advanced), use:
Auto safe mode (recommended for GUI responsiveness)DataLoader workersMax CPU threadsMax interop threadsAuto-run evaluation report after trainingAuto-report split(supportstest, with optional fallback toval)Auto-report PCK thresholdsGenerate paper artifacts (JSON/MD/CSV/LaTeX)
Equivalent annolid-run training flags are also supported (for example
--schedule-profile, --batch, --cos-lr, --warmup-epochs,
--best-metric, --freeze-bn, and augmentation epoch windows).
Unified Attention Head
DinoKPSEG now uses a single canonical attention architecture (relational) for
attention-based training. It keeps attention queries over keypoints and decodes masks
from query embeddings and refined spatial features via
einsum(mask_embed, spatial_features).
python -m annolid.segmentation.dino_kpseg.train \
--data /path/to/YOLO_dataset/data.yaml \
--head-type relational \
--relational-heads 4 \
--relational-layers 2
Optional symmetric-pair regularizers (requires flip_idx or inferable keypoint names):
python -m annolid.segmentation.dino_kpseg.train \
--data /path/to/YOLO_dataset/data.yaml \
--lr-pair-loss-weight 0.05 \
--lr-pair-margin-px 8
Note:
For YOLO pose datasets with very high left/right ambiguity, DinoKPSEG now auto-disables default LR regularizer weights to avoid destabilizing training. Explicit non-default values still override this behavior.
For high-ambiguity datasets, DinoKPSEG also auto-reduces geometry-heavy augmentation (disables hflip, and for ambiguous small datasets disables rotation/translate/scale) to reduce swap drift.
For HRNet-style training (
--head-type conv), DinoKPSEG follows a SuperAnimal-aligned schedule by default:epochs=210, AdamWlr=5e-4, and MultiStepLR milestones at epochs170and200(gamma=0.1).For very small training sets (
<64images) with HRNet, DinoKPSEG auto-enables BatchNorm running-stat freeze and lowers LR to5e-5unless you explicitly override. Use--freeze-bn/--no-freeze-bnfor manual control.For small/high-ambiguity datasets with orientation anchors (for example
nose+tailbase), DinoKPSEG can auto-canonicalize left/right supervision to reduce persistent identity inversion. Override with--lr-canonicalizeor--no-lr-canonicalize.
Optional left/right side-consistency (uses asymmetric anchors like nose/tailbase to define an axis):
python -m annolid.segmentation.dino_kpseg.train \
--data /path/to/YOLO_dataset/data.yaml \
--lr-side-loss-weight 0.10 \
--lr-side-loss-margin 0.0
Outputs:
A new run directory under
ANNOLID_RUNS_ROOT(or~/annolid_logs/runs) such as:~/annolid_logs/runs/dino_kpseg/train/20260101_120000/weights/best.pt~/annolid_logs/runs/dino_kpseg/train/20260101_120000/weights/last.pt~/annolid_logs/runs/dino_kpseg/train/20260101_120000/args.yaml~/annolid_logs/runs/dino_kpseg/train/20260101_120000/results.csv
To control where runs are written:
python -m annolid.segmentation.dino_kpseg.train \
--data /path/to/YOLO_dataset/data.yaml \
--runs-root /path/to/annolid_logs/runs \
--run-name experiment_01 \
--epochs 50
Inference (CLI)
annolid-run dino_kpseg predict \
--weights /path/to/weights/best.pt \
--image /path/to/frame.png \
--tta-hflip \
--tta-merge max \
--min-keypoint-score 0.20 \
--out /path/to/prediction.json
Notes:
--tta-hflipenables horizontal flip test-time augmentation.--tta-mergecontrols TTA fusion:mean(default) ormax.--min-keypoint-scoredrops low-confidence keypoints from output JSON.
Inference (GUI)
In Inference Wizard with model type DINO KPSEG, the Configure page now exposes:
Enable horizontal flip TTATTA merge(meanormax)Min keypoint score
These settings are persisted in QSettings and are also applied in the standard
GUI prediction workflow (not only wizard runs).
Training Wizard (GUI)
In Training Wizard with backend DINO KPSEG, the Configure page now includes:
Multiple tabs (
Core,Head Type,Loss,Runtime) for clearer navigationHead selection (
relational,conv,multitask) with dedicated per-head detail tabsRelational head controls (
heads,layers)Multitask loss weights (
objectness,box,instance) and aux warmupLoss controls (
threshold,BCE type, focal alpha/gamma)Backbone controls (
short side,layers,hidden dim)
Evaluation
Run evaluation on the train/val split and report mean pixel error, PCK, and left/right swap rate:
python -m annolid.segmentation.dino_kpseg.eval \
--data /path/to/YOLO_dataset/data.yaml \
--weights /path/to/dino_kpseg/weights/best.pt \
--split val \
--thresholds 4,8,16
You can also evaluate a test split (if your dataset YAML includes test):
python -m annolid.segmentation.dino_kpseg.eval \
--data /path/to/YOLO_dataset/data.yaml \
--weights /path/to/dino_kpseg/weights/best.pt \
--split test \
--thresholds 4,8,16
For LabelMe datasets, pass the spec YAML and set --data-format labelme:
python -m annolid.segmentation.dino_kpseg.eval \
--data /path/to/labelme_spec.yaml \
--data-format labelme \
--weights /path/to/dino_kpseg/weights/best.pt \
--split val
For COCO keypoints datasets, pass the COCO spec YAML and set --data-format coco.
Annolid will stage a temporary YOLO-pose view internally before evaluation.
Paper-ready Metric Reports
For paper tables, generate normalized summary artifacts (JSON + Markdown + CSV + LaTeX):
python -m annolid.segmentation.dino_kpseg.eval \
--data /path/to/YOLO_dataset/data.yaml \
--weights /path/to/dino_kpseg/weights/best.pt \
--split test \
--thresholds 4,8,16 \
--paper-report \
--dataset-name "MousePoseBench" \
--model-name "DINO-KPSEG (best.pt)" \
--report-dir /path/to/reports \
--report-basename mouseposebench_test
This writes:
mouseposebench_test.jsonmouseposebench_test.mdmouseposebench_test.csvmouseposebench_test.tex
LabelMe to COCO (GUI)
Use this when your annotations are in LabelMe JSON format and you want a COCO train/valid dataset for training or interoperability.
Open Annolid.
Go to Convert -> LabelMe -> COCO.
In Annotation dir, choose the folder containing LabelMe
*.jsonfiles.(Optional) Set Output dir. If left empty, Annolid uses
<annotation_dir>_coco_dataset.(Optional) Set Labels file (
labels.txt). If empty, labels are auto-detected from shape labels in JSON files.Set Train split (for example
70%).Choose Output mode:
Segmentation: polygon-oriented COCO export (default, backward compatible).Keypoints: strict COCO pose-style export (keypoints+num_keypoints).
Click OK to export.
What the dialog shows:
Number of discovered JSON files.
Number of valid JSON+image pairs.
Estimated train/valid counts based on the selected split.
What Annolid writes:
train/annotations.jsontrain/JPEGImages/*.jpgvalid/annotations.jsonvalid/JPEGImages/*.jpgannotations_train.jsonandannotations_valid.json(root-level convenience copies)data.yaml
In Keypoints mode, each COCO annotation represents one instance and includes:
keypoints(flattened[x, y, v]list),num_keypoints,category-level
keypointsnames incategories[*].keypoints.
Notes:
The exporter expects image files to be resolvable from each LabelMe JSON.
Shapes are exported as COCO segmentations; supported shape types include polygon, rectangle, circle, and point.
If no valid JSON/image pairs are found, export is blocked and the dialog shows an actionable warning.
COCO to LabelMe (GUI)
Use this when you have COCO keypoint annotations and want editable LabelMe JSON files.
Open Annolid.
Go to Convert → COCO to LabelMe.
Select your COCO annotations directory (for example
annotations/).Optionally select an image directory (you can cancel to use paths from the COCO file).
Select an output directory for the generated LabelMe dataset.
What the converter writes:
One image file plus one sidecar LabelMe JSON per COCO image (saved together).
polygonshapes from COCO polygon segmentations (when available).rectangleshapes from COCO bounding boxes (fallback when no polygons exist).pointshapes from COCO keypoints.
Notes:
Keypoints with non-visible flags (
v <= 0) are skipped.group_idis set from COCO annotation id, so polygon/box/keypoints from the same object stay linked.If your COCO JSON is under
annotations/and image files are under siblingimages/, Annolid auto-resolves those paths (you can still set images dir explicitly).imagePathin output JSON is written as the local image filename next to the JSON sidecar.
Running Inference (GUI)
Train the model (or point Annolid to a DinoKPSEG checkpoint).
In Annolid, select AI Model → DINOv3 Keypoint Segmentation.
Click Pred to run prediction.
Optional instance-aware inference: draw rectangle prompts on the canvas (one per animal). Annolid will run DinoKPSEG per box and keep keypoints grouped by instance.
The predictor saves:
Point shapes for each keypoint.
A small polygon “circle mask” per keypoint (for quick visual segmentation).